-
Notifications
You must be signed in to change notification settings - Fork 403
srv_resolver: add 15s timeout to DNS lookups #19026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
5059cd5
to
aaacc87
Compare
raise e | ||
except defer.TimeoutError as e: | ||
raise defer.TimeoutError( | ||
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)" | |
f"Timed out while trying to resolve DNS for SRV record for {service_name!r} (timeout=15s)" |
Does this sound better?
Only real nit is this previously said 50s total
vs our new 15s
timeout. Ideally, we'd have a constant to use here but I'm not sure that moving timeout=(1, 3, 3, 3, 5)
to the top as a constant is better. The comments probably read better in place where it's used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say to make it a constant, then its easy to do the following:
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)" | |
f"Timed out while trying to resolve DNS for SRV record for {service_name!r} (timeout={sum(LOOKUP_TIMEOUTS)}s)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like Jonathan's suggestion of using a constant, I've gone ahead and applied that suggestion :)
return list(cache_entry) | ||
else: | ||
raise e | ||
except defer.TimeoutError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a test that stressed this part of the code. Especially since I'm unsure if defer.TimeoutError
is actually the exception type raised here.
Do you think you would be up for that? Probably involves reactor.advance(15 + 1)
to advance time past the timeout. Otherwise, I can take a stab at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
15 seconds may be short, but I think that if a server isn't responding inbetween those multi-second retries, its having other issues anyways.
LGTM
Co-authored-by: Jonathan de Jong <[email protected]>
synapse.http.federation.srv_resolver.SrvResolver.resolve_service
isn't able to "timeout" properly, and thus stalls federation #9774Problem
The default timeout when looking up SRV records for the dns client is 60 seconds, which is quite long and can be problematic as that's the value of the default http client timeout, thus the error surfaced is the http error and not the underlying DNS error.
Solution
I've added a more aggressive timeout of 15 seconds total (with retries after 1, 3, 3, 3, and 5 seconds) for the SRV lookup. I've also re-labeled the `defer.TimeoutError, as that was done in the prior synapse PR, and makes the error more clear to users.
Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.